Simpson's Paradox¶

Use admission_data.csv for this exercise.

# Load and view first few lines of dataset
import pandas as pd
import numpy as np

df = pd.read_csv('admission_data.csv')
df.head()

Proportion and admission rate for each gender¶

# Proportion of students that are female
len(df[df['gender'] == 'female'])/df.shape[0]

0.514

# Proportion of students that are male
1 - _

0.486

# Admission rate for females
df[df['gender'] == 'female']['admitted'].mean()

0.28793774319066145

# Admission rate for males
df[df['gender'] == 'male']['admitted'].mean() #admission rates for females appear to be lower

0.48559670781893005

Proportion and admission rate for physics majors of each gender¶

# What proportion of female students are majoring in physics?

# given that a student is female, what is the probability they major in physics 
# that is the proportion of females and physics majors divided by the proportion of females
# since the denominators are the same, we can just get counts of each and take the ratio

df.query('gender == "female" and major == "Physics"').count()[0]/len(df[df['gender'] == 'female'])

0.12062256809338522

# What proportion of male students are majoring in physics?

df.query('gender == "male" and major == "Physics"').count()[0]/len(df[df['gender'] == 'male']) # many more males apply

0.92592592592592593

# Admission rate for female physics majors

# That is what proportion of females who apply in physics are admitted
fem_adm_phys = df.query('gender == "female" and major == "Physics" and admitted == True').count()[0]
fem_phys = df.query('gender == "female" and major == "Physics"').count()[0]

fem_adm_phys/fem_phys

0.74193548387096775

# Admission rate for male physics majors

# That is what proportion of males who apply in physics are admitted 
male_adm_phys = df.query('gender == "male" and major == "Physics" and admitted == True').count()[0]
male_phys = df.query('gender == "male" and major == "Physics"').count()[0]

male_adm_phys/male_phys #female admissions in physics are higher

0.51555555555555554

Proportion and admission rate for chemistry majors of each gender¶

# What proportion of female students are majoring in chemistry?
df.query('gender == "female" and major == "Chemistry"').count()[0]/len(df[df['gender'] == 'female'])

0.87937743190661477

# What proportion of male students are majoring in chemistry?
df.query('gender == "male" and major == "Chemistry"').count()[0]/len(df[df['gender'] == 'male']) #many fewer males

0.07407407407407407

# Admission rate for female chemistry majors
fem_adm_chem = df.query('gender == "female" and major == "Chemistry" and admitted == True').count()[0]
fem_chem = df.query('gender == "female" and major == "Chemistry"').count()[0]

fem_adm_chem/fem_chem

0.22566371681415928

# Admission rate for male chemistry majors
male_adm_chem = df.query('gender == "male" and major == "Chemistry" and admitted == True').count()[0]
male_chem = df.query('gender == "male" and major == "Chemistry"').count()[0]

male_adm_chem/male_chem #fewer males are admitted into chemistry as well as physics

0.1111111111111111

Admission rate for each major¶

# Admission rate for physics majors
df[df['major'] == "Physics"]['admitted'].mean()

0.54296875

# Admission rate for chemistry majors
df[df['major'] == "Chemistry"]['admitted'].mean()

0.21721311475409835

Many more females applied to chemistry, which had a lower admissions rate. Therefore, they had an overall lower admission rate. Though, females had higher admission rates conditionally in both physics and chemistry. This is known as Simpson's Paradox.

	student_id	gender	major	admitted
0	35377	female	Chemistry	False
1	56105	male	Physics	True
2	31441	female	Chemistry	False
3	51765	male	Physics	True
4	53714	female	Physics	True